feat: add gsd:plan-review-convergence command#2339
feat: add gsd:plan-review-convergence command#2339caius-kong wants to merge 6 commits intogsd-build:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe changes introduce a contract-based architecture for plan convergence loops. The workflow now enforces review agents to return structured Changes
Sequence DiagramsequenceDiagram
participant Orchestrator
participant ReviewAgent as Review Agent
participant PlanAgent as Plan Agent
participant User
Orchestrator->>Orchestrator: Init: load max_cycles, phase
alt No plans exist
Orchestrator->>PlanAgent: spawn gsd-plan-phase
PlanAgent-->>Orchestrator: PLAN.md created
end
loop Each cycle <= max_cycles
Orchestrator->>ReviewAgent: spawn gsd-review (--codex/etc)
ReviewAgent->>ReviewAgent: analyze plans
ReviewAgent-->>Orchestrator: return message + CYCLE_SUMMARY<br/>(current_high=N)
Orchestrator->>Orchestrator: parse CYCLE_SUMMARY, extract HIGH_COUNT
alt HIGH_COUNT == 0
Orchestrator->>Orchestrator: converged!<br/>update STATE.md, exit
else
Orchestrator->>Orchestrator: check stall:<br/>HIGH_COUNT >= prev_count?
alt HIGH_COUNT > 0 AND cycle >= max_cycles
Orchestrator->>User: escalation gate:<br/>proceed/manual review
User-->>Orchestrator: user choice
else
Orchestrator->>PlanAgent: spawn gsd-plan-phase<br/>--reviews --skip-research
PlanAgent-->>Orchestrator: revised PLAN.md
Orchestrator->>Orchestrator: next cycle
end
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@commands/gsd/plan-review-convergence.md`:
- Line 4: Update the argument-hint string to include the missing documented
flags so it's aligned with the CLI docs: change the existing argument-hint value
(the "argument-hint" entry in this file) from "<phase> [--codex] [--gemini]
[--all] [--max-cycles N]" to include the additional options (for example
"<phase> [--codex] [--gemini] [--claude] [--opencode] [--text] [--ws] [--all]
[--max-cycles N]"), and apply the same edit to the other argument-hint
occurrences noted elsewhere in the file (the repeated argument-hint blocks
around lines referenced in the comment).
In `@get-shit-done/workflows/plan-review-convergence.md`:
- Line 63: The markdown contains multiple fenced code blocks using ``` without
language tags which trips MD040; update each triple-backtick fence shown in the
diff (and the additional occurrences noted in the review) to include an
appropriate language identifier (for example use ```text for plain output
blocks, ```bash for shell snippets, or ```json for JSON examples) so every
fenced block is annotated and markdownlint MD040 is satisfied.
- Around line 138-140: The HIGH_COUNT assignment can produce "0\n0" because grep
-c exits nonzero when no matches and your || echo "0" runs again; change the
HIGH_COUNT pipeline to avoid emitting a second "0" and ensure a single numeric
value (for example, run grep -c ... 2>/dev/null || true to suppress the non-zero
exit, then normalize with parameter expansion like HIGH_COUNT=${HIGH_COUNT:-0});
update the assignment that defines HIGH_COUNT (and leave HIGH_LINES as-is) so
numeric comparisons against HIGH_COUNT work correctly.
- Around line 20-25: The script calls INIT via the node gsd-tools.cjs init
plan-phase command before extracting PHASE from ARGUMENTS, causing phase
metadata (phase_dir, padded_phase, has_plans, plan_count, etc.) to be
initialized with an empty/stale phase; reorder the logic so you first parse and
validate PHASE and other flags from ARGUMENTS (extract PHASE, ensure it's
non-empty, collect REVIEWER_FLAGS, MAX_CYCLES, GSD_WS, text_mode,
response_language), then run INIT=$(node
"$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" init plan-phase "$PHASE") and
parse its JSON output into phase_dir, phase_number, padded_phase, phase_name,
has_plans, plan_count, commit_docs, text_mode, response_language; add explicit
validation to fail early if PHASE is empty and ensure response_language is
handled after parsing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 06cd5b1f-a32b-4823-a43a-6cef6891d1ff
📒 Files selected for processing (4)
CHANGELOG.mdcommands/gsd/plan-review-convergence.mdget-shit-done/workflows/plan-review-convergence.mdtests/plan-review-convergence.test.cjs
trek-e
left a comment
There was a problem hiding this comment.
Adversarial review.
New command and workflow for a cross-AI plan convergence loop. Overall well-structured. Key findings:
Issues addressed by author: All 4 CodeRabbit findings (grep -c fallback, init-before-parse ordering, argument-hint completeness, code block language identifiers) confirmed addressed in commit c401fd3.
No security issues found: No command injection vectors, no path traversal, no unsafe eval/exec patterns. Agent spawning uses Skill() calls, not raw shell.
Structural concerns:
-
No test for the HIGH-count stall detection — the test suite covers loop structure, escalation gates, and REVIEWS.md verification, but there's no test verifying that
prev_high_count == HIGH_COUNTtriggers the stall warning. This is the most fragile logic path and the one most likely to silently regress. -
--wsforwarding to review agent — the workflow spawnsSkill('gsd-review', ...)without threadingGSD_WSthrough to it. If a user is working in a named workspace, the review agent may read from the wrong workspace. This was flagged by CodeRabbit (id 3097651140) and the author addressed init ordering but it's not clear--wsis forwarded. Worth confirming. -
Max-cycles=1 edge case — with
--max-cycles 1, if the initial review finds HIGHs, the loop escalates immediately without a replan. The workflow text handles this but the test suite doesn't cover it.
trek-e
left a comment
There was a problem hiding this comment.
Adversarial review — REQUEST CHANGES.
Trek-e's prior review (2026-04-17T13:15:29Z) raised three concerns. The author addressed CodeRabbit's four findings in commit c401fd3, but trek-e's structural concerns were not addressed. The PR has no response from the author on those points and no subsequent commits that cover them.
Unresolved concerns:
1. --ws forwarding to review agent (correctness bug)
The convergence workflow spawns Skill('gsd-review', args='--phase {PHASE} {REVIEWER_FLAGS}') but does not thread GSD_WS through to the review agent. If the user is working in a named workspace (--ws project-name), the review agent reads from the wrong workspace context. The replan agent does pass {GSD_WS} — making the behavior asymmetric: planning is workspace-aware, reviewing is not. The review agent must also receive the workspace flag.
2. Missing stall detection test
The test suite has 7 suites covering command structure, init, planning gate, convergence loop, stall detection, escalation gate, and artifact verification. The stall detection suite tests that prev_high_count is tracked and that a warning is emitted — but no test verifies the condition HIGH_COUNT >= prev_high_count triggers the warning when the count does not decrease between cycles. The existing stall tests are structural (string presence) rather than behavioral. This is the most fragile path: a refactor of the stall variable names or comparison would break stall detection silently with no test catching it.
3. --max-cycles 1 edge case
With --max-cycles 1, a cycle 1 review that finds HIGHs immediately hits the escalation gate without a replan attempt. The workflow handles this correctly in prose, but the test suite has no coverage for this boundary. Users who set --max-cycles 1 as a dry-run/review-only mode will hit behavior that is untested.
Required fixes before approval:
- Thread
GSD_WSthrough to the review agent spawn call inplan-review-convergence.md - Add a behavioral stall detection test: simulate decreasing HIGH count (no stall warning) and stable HIGH count (stall warning emitted)
- Add a
--max-cycles 1test covering the immediate escalation path
The spec, overall structure, and orchestrator-only design are correct. These are targeted gaps, not architectural problems.
|
CodeRabbit finding audit — all four findings were valid and are confirmed addressed. The most recent CodeRabbit review on this PR (run ID
The trek-e adversarial review (CHANGES_REQUESTED, 2026-04-20) has three structural concerns that are separate from the CodeRabbit findings and remain open: (1) |
|
Thanks for the thorough review @trek-e. All three concerns are addressed in commit 65c8c23: 1. 2. Behavioral stall detection tests — added. The suite now includes:
3.
Bonus fix (CYCLE_SUMMARY contract): In real-world validation on a phase with 3 review cycles, I discovered the raw-grep approach in cycle 2 triggered a false stall: the reviewer's cycle 2 output re-listed resolved HIGHs for audit, inflating the Fixed by mirroring Test count: 27 → 39. All green locally ( |
65c8c23 to
0cd8f81
Compare
|
@caius-kong — the three concerns from the previous review are addressed. The CYCLE_SUMMARY contract for extracting HIGH count from the agent's return message (rather than re-grepping REVIEWS.md) is the correct design choice — REVIEWS.md accumulates historical content across cycles so a raw grep would false-stall on resolved issues that are still mentioned for audit purposes. This PR just needs a rebase to resolve the merge conflict before it can land. Once rebased on main, it should be ready to merge. |
|
The rebase is already done — commit |
|
it says right on the screen that there are conflicts, needs to be resolved. |
Cross-AI plan convergence loop that automates the plan→review→replan cycle until no HIGH concerns remain or max cycles (default 3) reached. - Orchestrator-only loop control: spawns isolated Agents for existing gsd-plan-phase and gsd-review Skills — no new planning/review logic - Stall detection: warns immediately when HIGH count not decreasing - Escalation gate: AskUserQuestion when max cycles reached with HIGHs - TEXT_MODE fallback for plain-text escalation (Copilot compatibility) - vscode_askquestions runtime note for VS Code Copilot New files: commands/gsd/plan-review-convergence.md get-shit-done/workflows/plan-review-convergence.md tests/plan-review-convergence.test.cjs (27 tests, all passing) Closes gsd-build#2306 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- argument-hint: add missing --claude, --opencode, --text, --ws flags - workflow step ordering: parse ARGUMENTS before calling gsd-tools init so PHASE is populated when init plan-phase is invoked - HIGH_COUNT: use || true + parameter expansion to avoid "0\n0" when grep -c finds no matches and exits nonzero - MD040: add language tags to all fenced code blocks (bash/text/js) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
gsd-build#2306) Addresses the three structural concerns from trek-e's adversarial review and also fixes a latent cycle-awareness bug in HIGH detection. Changes: 1. Thread {GSD_WS} through to review agent spawn call (symmetric with replan). Previously review agent ran in default workspace when user passed --ws, while replan agent correctly received the flag. This was a real correctness bug for named workspaces. 2. Replace raw grep of REVIEWS.md with CYCLE_SUMMARY contract parsed from the review agent's return message. This mirrors gsd-plan-phase's pattern of parsing structured data from the checker's return (not re-reading the artifact file). Rationale: REVIEWS.md accumulates historical content across cycles. A cycle 2 review that lists resolved HIGHs for audit would inflate the raw grep count and falsely trigger stall detection, even though convergence was actually progressing. CYCLE_SUMMARY asks the reviewer to report ONLY current unresolved HIGHs (excluding resolved, historical refs, retrospective summary tables). Fail-loud if the contract is violated — no silent defaulting. 3. Strengthen test suite from 27 → 39 tests: - BEHAVIORAL stall detection tests: verify the >= condition correctly classifies decreasing/stable/increasing HIGH transitions (catches logic inversions or > vs >= regressions). - BEHAVIORAL --max-cycles 1 edge case: verify cycle=1/max=1/HIGHs>0 immediately escalates without replan, and cycle=1/max=1/HIGHs=0 converges via the HIGH_COUNT==0 branch. - New structural tests: step ordering (Parse ARGUMENTS before init), --ws forwarding to review agent, --ws forwarding to replan agent, CYCLE_SUMMARY regex presence, CYCLE_SUMMARY exclusion rules, CYCLE_SUMMARY contract fail-loud abort. Closes gsd-build#2306 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR gsd-build#2595 landed a repo-wide slash-prefix rename while this PR was open. The rebase took our workflow wholesale (preserving CYCLE_SUMMARY + 10 extra tests that Logan's gsd-build#2390 copy lacked), so the rename needs to be reapplied to the 6 affected lines (spawn prompts + "next steps" text). gsd-tools.cjs is a filesystem path, not a slash command — left alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0cd8f81 to
4c6367e
Compare
|
You're right — apologies, my prior reply was based on a stale mergeability snapshot. Re-investigated and found the actual picture: Root cause of the recurring conflict:
Rebase now done (
Status now: |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
get-shit-done/workflows/plan-review-convergence.md (1)
274-274:⚠️ Potential issue | 🟡 MinorUpdate success criteria to reflect CYCLE_SUMMARY parsing.
The success criteria states "grep HIGHs" but the workflow now extracts HIGH counts from the review agent's CYCLE_SUMMARY contract (lines 157-166), not by grepping REVIEWS.md. This inconsistency could confuse implementers.
📋 Suggested fix
-- [ ] Orchestrator only does: init, loop control, grep HIGHs, stall detection, escalation +- [ ] Orchestrator only does: init, loop control, parse CYCLE_SUMMARY for HIGH count, stall detection, escalation🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@get-shit-done/workflows/plan-review-convergence.md` at line 274, The success criteria line stating "grep HIGHs" is outdated; update the Orchestrator success criteria to reflect that it should parse the review agent's CYCLE_SUMMARY contract (see CYCLE_SUMMARY parsing logic referenced around lines 157-166) to extract HIGH counts instead of grepping REVIEWS.md, and clarify that the Orchestrator only handles init, loop control, stall detection, extraction of HIGH counts from CYCLE_SUMMARY, and escalation.
🧹 Nitpick comments (3)
get-shit-done/workflows/plan-review-convergence.md (3)
163-164: Consider validating the "Current HIGH Concerns" section format.The contract requires both CYCLE_SUMMARY (line 132) and the "## Current HIGH Concerns" section (line 141), but the abort behavior (line 163) only validates CYCLE_SUMMARY. An agent could include a valid CYCLE_SUMMARY but omit or misformat the concerns section, leading to empty
HIGH_LINESduring escalation display (lines 208, 222).While this won't break control flow (HIGH_COUNT drives the loop, not HIGH_LINES), it degrades the user experience at escalation by showing no concern details.
🛡️ Suggested validation enhancement
- `HIGH_COUNT`: the integer matched by regex `CYCLE_SUMMARY:\s+current_high=(\d+)`. If no match, abort with: `Review agent did not honor the CYCLE_SUMMARY contract — cannot determine HIGH count. Retry or switch reviewer.` - `HIGH_LINES`: the bullets under the `## Current HIGH Concerns` section of the return message (up to the next `##` heading or end of message). If the section contains only `None.`, set `HIGH_LINES=""`. + +If `HIGH_COUNT > 0` and `HIGH_LINES` is empty (section missing or malformed), warn: `Review agent's CYCLE_SUMMARY reports ${HIGH_COUNT} HIGHs but did not provide ## Current HIGH Concerns section — contract partially violated. Continuing with incomplete escalation details.`🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@get-shit-done/workflows/plan-review-convergence.md` around lines 163 - 164, Add a validation step after extracting CYCLE_SUMMARY/HIGH_COUNT that also verifies the "## Current HIGH Concerns" section was present and parsed into HIGH_LINES (or explicitly contains only "None."); if the section is missing or malformed (i.e., HIGH_LINES undefined or not a string/empty when not "None."), abort with a clear error similar to the CYCLE_SUMMARY check (e.g., "Review agent did not honor the CYCLE_SUMMARY contract — cannot determine HIGH concerns. Retry or switch reviewer.") so the workflow doesn't proceed with absent/misformatted HIGH_LINES; reference the CYCLE_SUMMARY, HIGH_COUNT and HIGH_LINES extraction/parsing logic and the "## Current HIGH Concerns" section when implementing this check.
134-136: Clarify "PARTIALLY RESOLVED" definition to ensure consistent counting.The INCLUDE list mentions "PARTIALLY RESOLVED HIGHs" but doesn't define what distinguishes "partially resolved" from "fully resolved." Different review agents or cycles might interpret this differently, leading to inconsistent HIGH counts that could trigger false stall detection.
Consider adding a brief definition or criteria, such as:
- PARTIALLY RESOLVED: concern acknowledged with mitigation in progress but not yet verified/completed
- FULLY RESOLVED: concern addressed with verification complete
📝 Suggested clarification
Where <N> counts HIGH-severity concerns that REMAIN UNRESOLVED in this cycle's findings: - - INCLUDE: newly raised HIGHs; PARTIALLY RESOLVED HIGHs; previously raised and still unresolved HIGHs + - INCLUDE: newly raised HIGHs; PARTIALLY RESOLVED HIGHs (acknowledged but mitigation incomplete); previously raised and still unresolved HIGHs - EXCLUDE: explicitly RESOLVED/FULLY RESOLVED HIGHs; HIGH mentions in retrospective/summary tables comparing cycles; quoted excerpts from prior reviews🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@get-shit-done/workflows/plan-review-convergence.md` around lines 134 - 136, Add clear, short definitions for the terms "PARTIALLY RESOLVED" and "FULLY RESOLVED" next to the INCLUDE/EXCLUDE guidance so reviewers count HIGHs consistently; specify that PARTIALLY RESOLVED means the HIGH is acknowledged and a mitigation or fix is in progress but not yet verified/completed (include expected evidence like an open ticket or remediation plan), and that FULLY RESOLVED means the issue is addressed and verification/closure steps are complete (include evidence such as closed ticket, verification log, or reviewer sign-off). Reference the existing INCLUDE list wording ("PARTIALLY RESOLVED HIGHs" and "explicitly RESOLVED/FULLY RESOLVED HIGHs") and insert the two one-line criteria immediately after that list.
163-163: Consider adding validation for malformed CYCLE_SUMMARY values.The current abort message "did not honor the CYCLE_SUMMARY contract" covers both missing and malformed cases, but distinguishing between them could help with debugging. For example, if the agent writes
CYCLE_SUMMARY: current_high=unknown, the error could indicate format problems rather than complete absence.🔍 Enhanced error reporting
-If no match, abort with: `Review agent did not honor the CYCLE_SUMMARY contract — cannot determine HIGH count. Retry or switch reviewer.` +If no match: + - Check if a CYCLE_SUMMARY line exists (even if malformed): hint at format issues + - Otherwise: report completely missing contract +Abort with: `Review agent did not honor the CYCLE_SUMMARY contract — cannot determine HIGH count. Retry or switch reviewer.`🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@get-shit-done/workflows/plan-review-convergence.md` at line 163, The parsing of HIGH_COUNT using the regex `CYCLE_SUMMARY:\s+current_high=(\d+)` should distinguish between a missing CYCLE_SUMMARY and a malformed value; update the code that currently aborts with "Review agent did not honor the CYCLE_SUMMARY contract — cannot determine HIGH count. Retry or switch reviewer." to first check for a missing match and keep the existing abort message, and if there is a match but the captured value does not parse to an integer (or fails validation), abort with a different, explicit message like "CYCLE_SUMMARY present but current_high is malformed — expected integer, got '<value>'", referencing the `HIGH_COUNT` extraction logic so callers can debug formatting vs absence.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@get-shit-done/workflows/plan-review-convergence.md`:
- Around line 157-166: The workflow lacks a concrete parser for the agent's
return message; implement extraction logic that reads the agent output
(AGENT_RETURN), captures HIGH_COUNT by matching the CYCLE_SUMMARY line (regex
for current_high=\d+), validates it is a non-empty integer and aborts with the
exact message "Review agent did not honor the CYCLE_SUMMARY contract — cannot
determine HIGH count. Retry or switch reviewer." if missing or invalid, and
extracts HIGH_LINES as the bullet lines under the "## Current HIGH Concerns"
section (stop at the next "##" or end of message), treating a section that
contains only "None." as HIGH_LINES=""; ensure robustness to minor formatting by
trimming whitespace and handling missing sections consistently.
---
Outside diff comments:
In `@get-shit-done/workflows/plan-review-convergence.md`:
- Line 274: The success criteria line stating "grep HIGHs" is outdated; update
the Orchestrator success criteria to reflect that it should parse the review
agent's CYCLE_SUMMARY contract (see CYCLE_SUMMARY parsing logic referenced
around lines 157-166) to extract HIGH counts instead of grepping REVIEWS.md, and
clarify that the Orchestrator only handles init, loop control, stall detection,
extraction of HIGH counts from CYCLE_SUMMARY, and escalation.
---
Nitpick comments:
In `@get-shit-done/workflows/plan-review-convergence.md`:
- Around line 163-164: Add a validation step after extracting
CYCLE_SUMMARY/HIGH_COUNT that also verifies the "## Current HIGH Concerns"
section was present and parsed into HIGH_LINES (or explicitly contains only
"None."); if the section is missing or malformed (i.e., HIGH_LINES undefined or
not a string/empty when not "None."), abort with a clear error similar to the
CYCLE_SUMMARY check (e.g., "Review agent did not honor the CYCLE_SUMMARY
contract — cannot determine HIGH concerns. Retry or switch reviewer.") so the
workflow doesn't proceed with absent/misformatted HIGH_LINES; reference the
CYCLE_SUMMARY, HIGH_COUNT and HIGH_LINES extraction/parsing logic and the "##
Current HIGH Concerns" section when implementing this check.
- Around line 134-136: Add clear, short definitions for the terms "PARTIALLY
RESOLVED" and "FULLY RESOLVED" next to the INCLUDE/EXCLUDE guidance so reviewers
count HIGHs consistently; specify that PARTIALLY RESOLVED means the HIGH is
acknowledged and a mitigation or fix is in progress but not yet
verified/completed (include expected evidence like an open ticket or remediation
plan), and that FULLY RESOLVED means the issue is addressed and
verification/closure steps are complete (include evidence such as closed ticket,
verification log, or reviewer sign-off). Reference the existing INCLUDE list
wording ("PARTIALLY RESOLVED HIGHs" and "explicitly RESOLVED/FULLY RESOLVED
HIGHs") and insert the two one-line criteria immediately after that list.
- Line 163: The parsing of HIGH_COUNT using the regex
`CYCLE_SUMMARY:\s+current_high=(\d+)` should distinguish between a missing
CYCLE_SUMMARY and a malformed value; update the code that currently aborts with
"Review agent did not honor the CYCLE_SUMMARY contract — cannot determine HIGH
count. Retry or switch reviewer." to first check for a missing match and keep
the existing abort message, and if there is a match but the captured value does
not parse to an integer (or fails validation), abort with a different, explicit
message like "CYCLE_SUMMARY present but current_high is malformed — expected
integer, got '<value>'", referencing the `HIGH_COUNT` extraction logic so
callers can debug formatting vs absence.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: d373578b-7291-41dd-aa81-42549ce38d2b
📒 Files selected for processing (1)
get-shit-done/workflows/plan-review-convergence.md
…ria (gsd-build#2306) Addresses CodeRabbit review on 4c6367e: 1. Success criteria in <success_criteria> still said "grep HIGHs" after we migrated to the CYCLE_SUMMARY contract. Updated to "parse CYCLE_SUMMARY for HIGH count" so implementers are not misled by the obsolete description. 2. CodeRabbit (Major) asked for concrete bash parsing code for the CYCLE_SUMMARY extraction. Clarified in a new "Parsing layer:" note that this extraction is performed by the orchestrator LLM on the Agent's return message, not a shell pipeline — there is no $AGENT_RETURN variable for grep/sed to consume, the return message is in-context for the orchestrator. This is the same data-flow shape gsd-plan-phase uses with gsd-plan-checker's return. Expressing it as shell code would imply a runtime that does not exist here. Tests: 39/39 still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @coderabbitai. Both findings addressed on the current branch tip: Finding 1 (success criteria says "grep HIGHs") — valid, fixed. Updated Finding 2 (concrete implementation for CYCLE_SUMMARY parsing) — not taking the bash suggestion. The extraction targets the Agent's return message text (in-context for the orchestrator LLM), not a file or shell variable. There is no Also a secondary concern with the suggested snippet: Tests: 39/39 still green. |
|
Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit. |
…parsing (gsd-build#2306) Added a defensive "Parsing layer:" paragraph in e38f66d to tell readers not to look for a $AGENT_RETURN bash variable. On review, this note was addressing a concern the surrounding text does not actually create: "Parse the agent's final return message" and the regex-based extraction rules already make clear this is not a shell pipeline, and no other section mentions $AGENT_RETURN. The note was appeasement for the bot review, not clarification for a human reader. Keeping the success criteria sync from e38f66d — that one was a real inconsistency (obsolete "grep HIGHs" after we moved to CYCLE_SUMMARY). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit. |
|
Summary of this round — PR is now ready:
Status: |
trek-e
left a comment
There was a problem hiding this comment.
Adversarial review — requesting changes
This is a cross-AI plan convergence loop with replanning. I'm evaluating the iteration on top of #2306 as shipped, plus what this PR actually changes (160/72 touching workflow + tests; CYCLE_SUMMARY contract, --ws symmetry, behavioral tests, slash-prefix rename).
Gall's Law — was this evolved from something simpler?
Yes, and visibly. Commit history shows this started as grep-of-REVIEWS.md, then trek-e's review + a latent cycle-awareness bug drove the move to a CYCLE_SUMMARY return contract. That is a legitimate evolutionary step from a working simpler system; credit where due. But the system is now simultaneously (a) a shell orchestrator writing $HIGH_COUNT-style pseudocode and (b) an LLM-level parsing layer — and the PR dropped the "Parsing layer:" note in d6b1a94 on the grounds that "surrounding text already implies LLM parsing." It does not imply that for anyone reading HIGH_COUNT=...regex match... at get-shit-done/workflows/plan-review-convergence.md:162-164, which is written as if something executes it. The note was not appeasement; the note was the one line telling a future implementer not to try to wire bash around this. Put it back.
Goodhart's Law — the termination metric is the vulnerability
"Loop until HIGH_COUNT == 0 or MAX_CYCLES" with HIGH_COUNT reported by the reviewer itself creates a target-becomes-measure: the reviewer learns (or is coached by the replanner) that downgrading HIGHs to MEDIUMs ends the loop. The CYCLE_SUMMARY contract at plan-review-convergence.md:125-142 is an honor-system instruction to an external AI — Codex/Gemini/Copilot — that has no skin in distinguishing HIGH from MEDIUM accurately. The "EXCLUDE: explicitly RESOLVED HIGHs" rule at :134 is especially exploitable: a replan that renames a concern rather than fixing it satisfies the contract. There is no guard against severity drift across cycles (e.g., tracking that no previously-HIGH became MEDIUM without a RESOLVED marker). Add one, or acknowledge this in the workflow as a known failure mode rather than shipping the metric as reliable.
Peter Principle — unbounded adversarial reviewer
If the external reviewer keeps inventing new HIGHs each cycle (not the "stall" case — the "invents different HIGHs each time" case), prev_high_count comparison at :170 only catches non-decreasing counts. 5 fresh HIGHs → 5 different fresh HIGHs reads as stall; 5 → 4 → 3 cascading churn of different concerns reads as progress and exhausts all 3 cycles. There is no "churn detection" (HIGH identity tracking across cycles). MAX_CYCLES caps wall time but not reviewer adversariality. Acceptable for this PR if called out — currently it is not.
Kernighan — state machine
Control flow is fine: 5a → 5b → 5c → 5d → 5a is linear. No clever async. prev_high_count is initialized before the loop (confirmed :170 reference), so the first-cycle stall-false-positive case is handled. This part is boring in the good way.
Knuth — cost per cycle unmeasured
3 cycles × (1 plan-phase + 1 cross-AI review + 1 replan) = ~9 LLM-agent invocations worst case, each themselves spawning sub-skills. No measurement, no budget, no --max-tokens or --max-cost escape hatch. For an XL feature whose name is "convergence loop" this is the first question an operator will ask. At minimum, log per-cycle timing so users can see what they paid.
Minor
plan-review-convergence.md:249— success_criteria still says "grep HIGHs".e38f66dwas supposed to sync this. Verify: the diff shows it was updated to "parse CYCLE_SUMMARY for HIGH count", but the file at HEAD above still reads "grep HIGHs" at line 249. Check this is actually committed.- CYCLE_SUMMARY contract failure mode at
:144is fail-loud ("abort"). Good. But there is no retry with a different reviewer — one malformed return ends the whole loop. Consider one retry before abort. tests/plan-review-convergence.test.cjs— 39 tests are all content assertions on workflow strings. These verify the prompt text exists, not that the LLM actually honors it. That is the limit of the pattern, but claiming "BEHAVIORAL" in test names (e.g.BEHAVIORAL: decreasing HIGH count...) is overclaim: those tests simulate a JS functionstallCondition, not the workflow. Rename toSIMULATED:or drop the caps.
Verdict
Changes requested. Core loop evolution is sound; the CYCLE_SUMMARY migration is a real improvement over raw grep. Blockers: (1) success_criteria line 249 still says grep, (2) no churn detection / severity-drift guard against the Goodhart attack the loop invites, (3) no cost observability for a loop that costs N×3 LLM calls, (4) restore the dropped Parsing-layer note. Address these, then this ships.
Linked Issue
Closes #2306
Feature summary
Adds
gsd:plan-review-convergence— a cross-AI plan convergence loop that automates the manualplan-phase → review → replan → re-reviewchain. The orchestrator spawns isolated Agents for existinggsd-plan-phaseandgsd-reviewSkills, then loops until no HIGH concerns remain in REVIEWS.md or max cycles (default 3) are reached.What changed
New files
commands/gsd/plan-review-convergence.mdgsd-plan-review-convergenceskillget-shit-done/workflows/plan-review-convergence.mdtests/plan-review-convergence.test.cjsModified files
CHANGELOG.md[Unreleased] → Addedentry for the new commandImplementation notes
Implementation matches the approved spec exactly. The only discretionary decisions:
<command-name>.test.cjsconventionchain-flag-plan-phase.test.cjs,code-review-command.test.cjs) — verifies workflow contains key instruction strings the LLM will read at runtime### Addedin[Unreleased]sectionSpec compliance
gsd:plan-review-convergencewith all documented flags (--codex,--gemini,--claude,--opencode,--all,--max-cycles N)--codexis the default reviewer when no flag specifiedgsd-plan-phaseif not, errors if no PLAN.md producedgsd-reviewAskUserQuestionwith "Proceed anyway" / "Manual review" optionsvscode_askquestionsruntime noteTesting
Test coverage
tests/plan-review-convergence.test.cjs— 27 tests across 7 suites:All 27 tests pass:
node --test tests/plan-review-convergence.test.cjsPlatforms tested
Runtimes tested
Scope confirmation
Checklist
Closes #2306approved-featurelabelnpm test)<objective>and<context>tagsBreaking changes
None
Screenshots / recordings
Real-world validation on Phase 04.1 (COMET Evaluation Metrics, 4 PLAN.md files):
Concerns caught: verdict math not switched to new metric, missing INCONCLUSIVE outcome paths, silent COMET model failure modes. All resolved in one replan cycle.
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests